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Abstract 

Recent work has shown how information theory extends conventional full-rationality game 
theory to allow bounded rational agents. The associated mathematical framework can be used to 
solve constrained optimization problems. This is done by translating the problem into an iterated 
game, where each agent controls a different variable of the problem, so that the joint probability 
distribution across the agents' moves gives an expected value of the objective function. The 
dynamics of the agents is designed to minimize a Lagrangian function of that joint distribution. 
Here we illustrate how the updating of the Lagrange parameters in the Lagrangian is a form of 
automated annealing, which focuses the joint distribution more and more tightly about the joint 
moves that optimize the objective function. We then investigate the use of "semicoordinate" 
variable transformations. These separate the joint state of the agents from the variables of the 
optimization problem, with the two connected by an onto mapping. We present experiments 
illustrating the ability of such transformations to facilitate optimization. We focus on the special 
kind of transformation in which the statistically independent states of the agents induces a 
mixture distribution over the optimization variables. Computer experiment illustrate this for 
fc-sat constraint satisfaction problems and for unconstrained minimization of NK functions. 

Subject Classification: programming: nonlinear, algorithms, theory; probability: applications 
Area of Review: optimization 
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1 Introduction 



1.1 Distributed optimization and control with Probability Collectives 

As first described in (Wolpert 2003a, Wolpert 20046), it turns out that one can translate many of 
the concepts from statistical physics, game theory, distributed optimization and distributed control 
into one another. This translation is based on the fact that those concepts all involve distributed 
systems in which the random variables are, at any single instant, statistically independent. (What 
is coupled is instead the distributions of those variables.) Using this translation, one can transfer 
theory and techniques between those fields, creating a large common mathematics that connects 
them. This common mathematics is known as Probability Collectives (PC). Its unifying concern 
is the set of probability distributions that govern any particular distributed system, and how to 
manipulate those distributions to optimize one or more objective functions. See (Wolpert, Turner 
& Bandari 2003, Wolpert & Turner 2001) for earlier, less formal work on this topic. 

In this paper we consider the use of PC to solve constrained optimization and/or control prob- 
lems. Reflecting the focus of PC on distributed systems, its use for such problems is particularly 
appropriate when the variables in the collective are spread across many physically separated agents 
with limited inter-agent communication (e.g., in a distributed design or supply chain application, 
or distributed control) . A general advantage of PC for such problems is that since they work with 
probabilities rather than the underlying variables, they can be implemented for arbitrary types of 
the underlying variables. This same characteristic also means they provides multiple solutions, each 
of which is robust, along with sensitivity information concerning those solutions. An advantage 
particulary relevant to optimization is that the distributed PC algorithm can often be implemented 
on a parallel computer. An advantage particularly relevant to control problems is that PC algo- 
rithms can, if desired, be used without any modelling assumptions about the (stochastic) system 
being controlled. These advantages are discussed in more detail below. 

1.2 The Probability Collectives Approach 

Broadly speaking, the PC approach to optimization/control is as follows. First one maps the 
provided problem into a multi-agent collective. In the simplest version of this process one assigns 
a separate agent of the collective to determine the value of each of the variables x« G Xi in the 
problem that we control. So for example if the i'th variable can only take on a finite number 
of values, those \Xi\ possible values constitute the possible moves of the i'th agent. 1 The value 
of the joint set of n variables (agents) describing the system is then x = [xi, ■ ■ ■ , x n ] 6 X with 
X = Xi x • ■ ■ x X n ? 

Unlike many optimization methods, in PC the variables are not manipulated directly. Rather a 
probability distribution is what is manipulated. To avoid combinatorial explosions as the number 
of dimensions of X grows, we must restrict attention to a low-dimensional subset of the space of 
all probability distributions. We indicate this by writing our distributions as q G Q over X. The 
manipulation of that q proceeds through an iterative process. The ultimate goal of this process is 
to induce a distribution that is highly peaked about the x optimizing the objective function G(x), 
sometimes called the world cost or world utility function. (In this paper we only consider problems 
with a single overall objective function, and we arbitrarily choose lower values to be better, even 
when using the term "utility".) 

In the precise algorithms investigated here, at the start of any iteration a single Lagrangian 
function of q, C : Q — > R, is specified, based on G(x) and the associated constraints of the 
optimization problem. Rather than minimize the objective function over the space X, the algorithm 
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minimizes that Lagrangian over q 6 Q. This is done by direct manipulation of the components of 
q by the agents. 

After such a minimization of a Lagrangian, one modifies the Lagrangian slightly. This is done 
so that the q optimizing the new Lagrangian is more tightly concentrated about x that solve our 
optimization problem than is the current q. One then uses the current q as the starting point for 
another process of having the agents minimize a Lagrangian, this time having them work on that 
new Lagrangian. 

At the end of a sequence of such iterations one ends up with a final q. That q is then used to 
determine a final answer in X, e.g., by sampling q, evaluating its mode, evaluating its mean (if that 
is defined), etc. For a properly chosen sequence of Lagrangians and algorithm for minimizing the 
Lagrangians, this last step should, with high probability, provide the desired optimal point in X. 

For the class of Lagrangians used in this paper, the sequence of minimizations of Lagrangians is 
closely related to simulated annealing. The difference is that in simulated annealing an inefficient 
Metropolis sampling process is used to implicitly descend each iteration's Lagrangian. By explicitly 
manipulating q, PC allows for more efficient descent. 

In this paper we shall consider the case where Q is a product space, g(x) = fli^^i)- The 
associated formulation of PC is sometimes called "Product Distribution" theory. It corresponds 
to noncooperative game theory, with each qi being agent Vs "mixed strategy" (Wolpert 20046, 
Fudenberg & Tirole 1991). Our particular focus is the use of such product distributions when 
X is not the same as the ultimate space of the optimization variables, Z. In this formulation 
- a modification of what was presented above — there is an intermediate mapping from X — ► 

and the provided G is actually a function over Z, not (directly) over X. Such intermediate 
mappings are called "semicoordinate systems" , and going from one to another is a "semicoordinate 
transformation". As elaborated below, such transformations allow arbitrary coupling among the 
variables in Z while preserving many of the computational advantages of using product distributions 
over X. 

1.3 Advantages of Probability Collectives 

There are many advantages to working with distribution in Q rather than points in X. Usually 
the support of q is all of X, i.e., the q minimizing the Lagrangian lies in the interior of the unit 
simplices giving Q. Conversely, any element of X can be viewed as a probability distribution on the 
edge (a vertex) of those simplices. So working with X is a special case of working with Q, where 
one sticks to the vertices of Q. In this, optimizing over Q rather than X is analogous to interior 
point methods. Due to the breadth of the support of q, minimizing over it can also be viewed as 
a way to allow information from the value of the objective function at all x G X to be exploited 
simultaneously. 

Another advantage, alluded to above, is that by working with distributions Q rather than 
the space X, the same general PC approach can be used for essentially any A', be it continuous, 
discrete, time-extended, mixtures of these, etc. (Formally, those different spaces just correspond 
to different probability measures, as far as PC is concerned.) For expository simplicity though, 
here we will work with finite X, and therefore have probability distributions rather than density 
functions, sums rather than integrals, etc. See in particular (Bieniawski & Wolpert 2004a, Wolpert 
& Bieniawski 2004a, Wolpert 2004a, Wolpert, Strauss & Rajnayaran 2006) for analysis explicitly 
for the case of infinite X. 

Yet another advantage arises from the fact that when X is finite, q £ Q is a vector in a Euclidean 
space. Accordingly the Lagrangian we are minimizing is a real-valued function of a Euclidean 
vector. This means PC allows us to leverage the power of descent schemes for continuous spaces 
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like gradient descent or Newton's method — even if X is a categorical, finite space. So with PC, 
schemes like "gradient descent for categorical variables" are perfectly well-defined. 

While the Lagrangians can be based on prior knowledge or modelling assumptions concerning 
the problem, they need not be. Nor does optimization of a Lagrangian require control of all variables 
X (i.e., some of the variables can be noisy). This allows PC to be very broadly applicable. 

1.4 Connection with other sciences 

A more general advantage of PC is how it relates seemingly disparate disciplines to one another. In 
particular, it can be motivated by using information theory to relate bounded rational game theory 
to statistical physics (Wolpert 2003a, Wolpert 20046). This allows techniques from one field to 
be imported into the other field. For example, as illustrated below, the grand canonical ensemble 
of physics can be imported into noncooperative game theory to analyze games having stochastic 
numbers of the players of various types. 

To review, a noncooperative game consists of a sequence of stages. At the beginning of each 
stage every agent (aka "player") sets a probability distribution (its "mixed strategy") over its 
moves (Fudenberg & Tirole 1991, Aumann & Hart 1992, Basar & Olsder 1999, Fudenberg & Levine 
1998). The joint move at the stage is then formed by agents simultaneously sampling their mixed 
strategies at that stage. So the moves those agents make at any particular stage of the game 
are statistically independent and the distribution of the joint-moves at any stage is a product 
distribution — just like in PC theory. 

This does not mean that the moves of the agents across all time are statistically independent 
however. At each stage of the game each agent will set its mixed strategy based on information 
gleaned from preceding stages, information that in general will reflect the earlier moves of the other 
agents. So the agents are coupled indirectly, across time, via the updating of the {qi}f =1 at the end 
of each stage. 

Analogously, consider again the iterative PC algorithm outlined above, and in particular the 
process of optimizing the Lagrangian within some particular single iteration. Typically that process 
proceeds by successively modifying q across a sequence of timesteps. In each of those timesteps 
q(x) = YliQi( x i) is first sampled, and then it is updated based on all previous samples. So just 
like in a noncooperative game there is no direct coupling of the values of the underlying variables 
{xj}at any particular timestep (q is a product distribution). Rather just like in a noncooperative 
game, the variables are indirectly coupled, across time (i.e., across timesteps of the optimization), 
via coupling of the distributions qi(xi) at different timesteps. 

In addition, information theory can be used to show that the bounded rational equilibrium of any 
noncooperative game is the q optimizing an associated "maxent Lagrangian" C(q) (Wolpert 20046). 
(That Lagrangian is minimized by the distribution that has maximal entropy while being consistent 
with specified values of the average payoffs of the agents.) This Lagrangian turns out to be exactly 
the one that arises in the version of PC considered in this paper. So bounded rational game theory 
is an instance of PC. 

Now in statistical physics often one wishes to find the distribution out of an allowed set of 
distributions (e.g., Q) with minimal distance to a fixed target distribution p G V, the space of all 
possible distributions over X. Perhaps the most popular choice for a distance measure between 
distributions is the Kullback-Leibler (KL) distance 3 : D(q\\p) = ^ x g(x) ln(g(x)/p(x)) (Cover & 
Thomas 1991). As the KL distance is not symmetric in its arguments p and q we shall refer to 
D(q\\p) as the qp KL distance (this is also sometimes called the exclusive KL distance), and D(p\\q) 
as the pq distance (also sometimes called the inclusive KL distance). 
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Typically in physics p is given by one of the statistical "ensembles". An important exam- 
ple of such KL minimization arises with the Boltzmann distribution of the canonical ensemble: 
p(x) oc exp[— H(x)/T], where H is the "Hamiltonian" of the system. The KL distance D{q\\p) 
to the Boltzmann distribution is proportional to the Gibbs free energy of statistical physics. This 
free energy is identical to the maxent Lagrangian considered in this paper. Stated differently, if 
one solves for the distribution q from one's set that minimizes qp KL distance to the Boltzmann 
distribution, one gets the distribution from one's set having maximal entropy, subject to the con- 
straint of having a specified expected value of H. When the set of distributions one's considering 
is Q, the set of product distributions, this q minimizing qp KL distance to p is called a "mean-field 
approximation" to p. So mean-field theory is an instance of PC. 

This illustrates that bounded rational games and the mean-field approximation to Boltzmann 
distributions are essentially identical. To relate them one equates H with a common payoff function 
G. The equivalence is completed by then identifying each (independent) agent with a different one 
of the (independent) physical variables in the argument of the Hamiltonian. 4 

This connection between these fields allows us to exploit techniques from statistical physics 
in bounded rational game theory. For example, as mentioned above, rather than the canonical 
ensemble, we can apply the grand canonical ensemble to bounded rational games. This allows us 
to consider games in which the number of players of each type is stochastic (Wolpert 20046). 

1.5 The contribution of this paper 

The use of a product distribution space Q for optimization is consistent with game theory (and 
more generally multi-agent systems). Further, this choice results in a highly parallel algorithm, and 
is well-suited to problems that are inherently distributed. Nonetheless, other concerns may dictate 
different Q. In particular, in many optimization tasks we seek multiple solutions which may be far 
apart from one another. For example, in Constraint Satisfaction Problems (CSPs) (Dechter 2003), 
the goal is to identify all feasible solutions which satisfy a set of constraints, or to show that none 
exist. For small problem instances exhaustive enumeration techniques like branch-and-bound are 
typically used to identify all such feasible solutions. However, for larger problems it is desirable to 
develop local-search-based approaches which determine multiple distinct solutions in a single run. 

In cases like these, where we desire multiple distinct solutions, the use of PC with a product 
distribution is a poor choice. The problem is that if each distribution % is peaked about every value 
of Xi which occurs in at least one of the multiple solutions, then in general there will be spurious 
peaks in the product g(x) = TJ[ qi(xi), i.e., q(x) may be peaked about some x that are not solutions. 
Alternatively, if each % is peaked only about a few of the solutions, this does not provide us with 
many solutions. To address this we might descend the Lagrangian many times, beginning from 
different starting points (i.e., different initial q). However there is no guarantee that multiple runs 
will each generate different solutions. 

PC offers a simple solution to this problem that allows one to still use product distributions: 
extend the event space underlying the product distribution so that a single game provides multiple 
distinct solutions to the optimization problem. Intuitively speaking, such a transformation recasts 
the problem in terms of a "meta-game" by cloning the original game into several simultaneous 
games, with an independent set of agents for each game. A supervisory agent chooses what game 
is to be played. We then form a Lagrangian for the meta-game that is biased towards having any 
agents that control the same variable in different games have different mixed strategies from one 
another. The joint strategies for each of the separate games in the meta-game then give a set 
of multiple solutions to the original game. The supervisory agent sets the relative importance of 
which such solution is used. Since in general the resultant distribution across the variables being 
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optimized (i.e., across Z) cannot be written as a single product distribution, it provides coupling 
among those variables. 

Formally, the above process can be represented as a semi coordinate transformation. Recall 
that the space of arguments to the objective function G{z) is Z, and that the product distribu- 
tion is defined over X. A "semicoordinate system" maps from X to Z (Wolpert & Bieniawski 
2004a, Wolpert 2004c). Before introduction of the semicoordinate system X = Z, and product 
distributions over X give product distributions over Z. However when X ^ Z and we introduce 
a semicoordinate system, product distributions over X (i.e., the noncooperative game is played in 
X) need not be product distributions on Z. By appropriate choice of the semicoordinate trans- 
formation, such distributions can be made to correspond to any coupled distributions across Z. 
In general, any Bayes net topology can be achieved with an appropriate semicoordinate trans- 
formation (Wolpert 2004c, Wolpert & Bieniawski 2004a). Different product distributions over X 
correspond to different Bayes nets having the same independence relations. 

Here we consider a X that results in a mixture of M product distributions Z, 

M 



q(z) = £ g°(m)<T(z) 



m=l 



Intuitively, q° is the distribution over the moves of the supervisor agent, with m labelling the 
game that agent chooses. This mixture of product distributions allows for the determination of 
M solutions at once. At the same time, an entropy term in the Lagrangian "pushes" the separate 
products q m {z) in the mixture apart. This biases the algorithm to locating well separated solutions, 
as desired. 

In Sec. [2] we review how one arrives at the Lagrangian considered in this paper, the maxent 
Lagrangian. In Sec. [3] we review two elementary techniques introduced in (Wolpert & Bieniawski 
20046, Wolpert 2003a, Wolpert 2004a) for updating a product distribution q to minimize the as- 
sociated Lagrangian. Depending on the form of the objective, the terms involved in the updating 
of q may be evaluated in closed form, or may require estimation via Monte Carlo methods. In the 
experiments reported here all terms are calculated in closed form. However, to demonstrate the 
wider applicability of the update rules we review in Appendix [A] a set of Monte-Carlo techniques 
providing low variance estimates of required quantities. The first derivation of these estimators is 
presented in this work. 

With this background review complete, semicoordinate transformations are introduced in Sec.|4j 
As an illustration of the use of semicoordinate transformations particular attention is placed on 
mixture models, and how mixture models may be seen as a product distributions over a different 
space. In Sec{5] we analyze the minimization of the maxent Lagrangian associated with mixture- 
inducing semicoordinate transformations. In that section we also relate our maxent Lagrangian 
for mixture distributions to the Jensen-Shannon distance over X. Experimental validation of these 



techniques is then presented for the fc-satisfiability CSP problem (section 6.1) and the NK family 



of discrete optimization problems (section 6.2). These sections consider the situation where the 
semicoordinate transformation is fixed a priori, but suggestions are made on how to determine a 
good semicoordinate transformation dynamically as the algorithm progresses. We conclude with a 
synopsis of some other techniques for updating a product distribution q to minimize the associated 
Lagrangian. This synopsis serves as the basis for a discussion of the relationship between PC and 
other techniques. 

Like all of PC, the techniques presented in this paper can readily be applied to problems other 
than constrained optimization. For example, PC provides a natural improvement to the Metropolis 
sampling algorithm (Wolpert & Lee 2004), which the techniques of this paper should be able to 



6 



improve further. In addition, while for simplicity we focus here on optimization over countable 
domains, PC can be extended in many ways to continuous space optimization. The associated 
technical difficulties can all be addressed(Wolpert et al. 2006). See (Antoine, Bieniawski, Kroo 
& Wolpert 2004, Wolpert & Bieniawski 2004a, Bieniawski, Wolpert & Kroo 2004, Bieniawski k 
Wolpert 20046) for other examples of PC and experiments. 

It should be emphasized that PC usually is not a good choice for how best to optimize problems 
lying in some narrowly defined class. When one know a lot about the class of optimization problems 
under scrutiny, algorithms that are hand-tailored for that class will almost be called for. It is also 
in such situations that one often can call upon formal convergence bounds. In contrast, PC is in the 
spirit of Genetic Algorithms, the cross entropy method, simulated annealing, etc. It is a broadly 
applicable optimization algorithm that performs well in many domains, even when there is little 
prior knowledge about the domain. (See (Wolpert & Macready 1997) for a general discussion of 
this issue.) 

Finally, in statistical inference, one parameterizes the possible solutions to one's problem to 
reduce the dimensionality of the solution space. Without such parameterization, the curse of 
dimensionality prevents good performance, in general (Duda, Hart & Stork 2000). By choosing 
one's parameterization though, one assumes (implicitly or otherwise) that that parameterization is 
flexible enough to capture the salient aspects of the stochastic process generating one's data. In 
essence, one assumes that the parameterization "projects out" the noise while keeping the signal. 

The PC analogue of the problem of what parameterization to use is what precise semicoordinate 
transformation to use. Just as there is no universally correct choice of how to parameterize a 
statistics problem, there is no universally correct choice of what semicoordinate transformation 
to use. In both situations, one must rely on prior knowledge to make one's choice, potentially 
combined with conservative online adaptation of that choice. 

2 The Lagrangian for Product Distributions 

We begin by considering the case of the identity semicoordinate system, and X = Z. As dis- 
cussed above, we consider qp distance to the T-parameterized Boltzmann distribution p(x) = 
exp[— G(x)/T]/Z(T) where Z(T) is the normalization constant. At low T the Boltzmann distribu- 
tion is concentrated on x having low G values, so that the product distribution with minimal qp 
distance to it would be expected to have the same behavior. Accordingly, one would expect that 
by taking qp KL distance to this distribution as one's Lagrangian, and modifying the Lagrangian 
from one iteration to the next by lowering T, one should end up at a q concentrated on x having 
low G values. (See (Wolpert & Bieniawski 20046, Wolpert 2004a) for a more detailed formal justifi- 
cation of using this Lagrangian based on solving constrained optimization problems with Lagrange 
parameters.) 

More precisely, the qp KL distance to the Boltzmann distribution is the maxent Lagrangian, 

C(q)=E q (G)-TS(q) (1) 

up to irrelevant additive and multiplicative constants. Equivalently, we can write it as 

C{q) = f3E g (G) - S(q) (2) 

where (5 = 1/T, up to an irrelevant overall constant. In these equations the inner product K q (G) = 
^ x g(x)G(x) is the expected value of G under q, and S(q) = — ^ x <j(x) In q(x) is the Shannon 
entropy of q. 
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For q's which are product distributions S(q) = ^2 i S(qi) where S(qi) = —J2 x Qi( x i)^ n Qi( x i)- 
Accordingly, we can view the maxent Lagrangian as equivalent to a set of Lagrangians, Ci(q) = 
^2 Xi M q _ i [G(x i ,-x.-i)]q i (xi)—TSi(qi), one such Lagrangian for each agent i so that £(q) = Y%=i A(<z)« 5 
The first term in C is minimized by having perfectly rational players, i.e. by players who concen- 
trate all their probability on the moves that are best for them, given the distributions over the 
agents. The second term is minimized by perfectly irrational players, i.e., by a perfectly uniform 
joint mixed strategy q. So T specifies the balance between the rational and irrational behavior of 
the players. In particular, for T — > 0, by minimizing the Lagrangian we recover the Nash equilibria 
of the game. Alternatively, from a statistical physics perspective, where T is the temperature of 
the system, this maxent Lagrangian is simply the Gibbs free energy for the Hamiltonian G. 

2.1 Incorporating constraints 

Since we are interested in problems with constraints, we replace G in Eqs. ([IJ and Q with 

c 

G(x) + ^A a c a (x) (3) 

a=l 

where G is the original objective function and the c a are the set of C equality constraint functions 
that are required to be equal to zero. For constraint satisfaction problems we take the original 
objective function to be the constant function 0. The A a are Lagrange multipliers that are used to 
enforce the constraints. Collectively, we refer to the Lagrange multipliers with the Cxi vector A. 
The constraints are all equality constraints, so a saddle point of the Lagrangian over the space of 
possible q and A is a solution of our problem. Note however that we do not have to find the exact 
saddle point; in general sampling from a q close to the saddle point will give us the x's we seek. 

There are certainly other ways in which constraints can be addressed within the PC framework. 
An alternative approach might allow constraints to be weakly violated. We would then iteratively 
anneal down those weaknesses, i.e., strengthen the constraints, to where they are not violated. In 
this approach we could replace the maxent Lagrangian formulation encapsulated in Eq.'s ([2| and 
<j3j) with 

C(q, A A) = m q (G) - 7g] +^A a [E 9 ( Co ) - 7a ] - S(q). (4) 

a 

In each iteration of the algorithm (3, A are treated as Lagrange parameters and one solves for their 
values that enforce the equality constraints E g (G) = 7g, and the C constraints E g (c a ) = 7 a while 
also minimizing C(q,f3,X). In the usual way, since our constraints are all equalities, one can do 
this by finding saddle points of C(q,(3,X). The next iteration would then start by modifying our 
Lagrangian by shrinking the values 7g, {"fa} slightly before proceeding to a new process of finding 
a saddle point. 

Another more theoretically justified way to incorporate constraints requires that the support 
of q is constrained to lie entirely within the feasible region. Any x which violates a constraint is 
assigned probability, i.e. q(x) = at all x which violate the constraints. 

For pedagogical simplicity, we do not consider these alternative approaches, but concentrate on 
the Lagrangian of Eq. ([I]) with the G of Eq. ([3]). In addition to the constraints associated with 
the optimization problem the vectors {%} must be probability distributions. So there are implicit 
constraints our solution must satisfy: < qi(xt) < 1 for all i and X{, and ^ qi{xi) = 1 for all i. 
To reduce the size of our equations we do not explicitly write these constraints. 
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3 Minimizing the maxent Lagrangian 



For fixed (3, our task is to find a saddle point of C(q, A). In "first order methods" a saddle point 
is found by iterating a two-step process. In the first step the Lagrange parameters A are fixed and 
one solves for the q that minimizes the associated £. 6 In the second step one then freezes that q 
and updates the Lagrange parameters. There are more sophisticated ways of finding saddle points 
(Grantham 2005), and more generally one can use modified versions of the Lagrangian (e.g., an 
augmented Lagrangian (Bertsekas 1996)). For simplicity we do not consider such more sophisticated 
approaches. 

In this section we review two approaches to finding the {q{\ for fixed Lagrange multipliers A. 
We also describe our approach for the second step of the first order method, i.e., we describe how we 
use gradient ascent to update the Lagrange multipliers A a for fixed q. See (Wolpert 2003a, Wolpert 
& Bieniawski 20046, Wolpert 2004a) for further discussion of these approaches as well as the many 
others one can use. 



3.1 Brouwer Updating 

At each step t the direction in the simplex Q that, to first order, maximizes the drop in C is given 
by (-1 times) 

V q £(q)±V q £(q) - V (q). (5) 

In this equation f]{q) is proportional to the unit vector, with its magnitude set to ensure that a 
step in the direction V q C{q) remains in the unit simplex. Furthermore, the qi{xi) component of the 
gradient, one for every agent i and every possible move x% by the agent, is (up to constant terms 
which have been absorbed into r](q)): 

dL 

[V q C(q)] qi{xi) = ^y-r = E q _lG\xi) + Tlnfe^)] (6) 

where 

Vi^l^i) = 2 q-i(x-i)G(x i} x_<) (7) 

with x_j = [xi,--- ,Xi-i,Xi+i,"- ,x n ] and quj(x_j) = XYj=i\jjki<lj( x j)- v(o) is tne vector that 
needs to be added to S7 q C(q) so that each qi is properly normalized. 7 The qi(xi) component of 
77(a), is equal to 

fofaJWi) = mil) - ryr Yl [ V s £ (3)W') ( 8 ) 

x i 

where \Xi\ is the number of possible moves (allowed values) X{. Not that for any agent i, all of the 
associated components of T](q), namely qi(x%), • • • , qi(x\x.\), share the same value rji(q). This choice 
ensures that ^2 X , qi(xi) = 1 after the gradient update to the values qi(xi). 

The expression in Eq. ([7| is the expected payoff to agent i when it plays move Xj, under the 
distribution a_j across the moves of all other agents. Setting V q C(q) to zero gives the solution 

ql +1 ( Xl ) oc exp[-E qli (G\xi)/T] (9) 

Brouwer 's fixed point theorem guarantees the solution of Eq. ([9]) exists for any G (Wolpert 20046, 
Wolpert 2003a). Hence we call update rules based on this equation Brouwer updating. 

Brouwer updating can be done in parallel on all the agents. However, one problem that can 
arise if all agents update in parallel is "thrashing" . In Eq. ^ each agent i adopts the qi that is 
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be optimal assuming the other agents don't change their distributions. However, other agents do 
change their distributions, and thereby at least partially confound agent i. One way to address this 
problem is to have agent i not use the current value K q t [G\xi) alone to update qj(xi), but rather 
use a weighted average of all values E t > (G\xi) for t' < t, with the weights shrinking the further into 

the past one goes. This introduces an inertia effect which helps to stabilize the updating. (Indeed, 
in the continuum-time limit, this weighting becomes the replicator dynamics (Wolpert 2004c).) 

A similar idea is to have agent i use the current K q t (G\xi) alone, but have it only move part of 
the way the parallel Brouwer update recommends. Whether one moves all the way or only part- way, 
what agent i is interested in is what distribution will be optimal for the next distributions of the 
other agents. Accordingly, it makes sense to have agent i predict, using standard time-series tools, 
what those future distributions will be. This amounts to predicting what the next vector of values 
of K q t (G\xi) will be, based on seeing how that vector has evolved in the recent past. See (Shamma 
& Arslan 2004) for related ideas. 

Another way of circumventing thrashing is to have the agents update their distributions serially 
(one after the other) rather than in parallel. See (Wolpert & Bieniawski 2004a) for a description of 
various kinds of serial schemes, as well as a discussion of partial serial, partial parallel algorithms. 

3.2 Nearest-Newton Updating 

To evaluate the gradient one only needs to evaluate or estimate the terms K q t (G\xi) for all agents 
(see below and (Wolpert 2003a, Wolpert 20046)). Consequently, gradient descent is typically 
straight-forward. Though, it is also usually simple to evaluate the Hessian of the Lagrangian, 
conventional Newton's descent is intractable for large systems because inverting the Hessian is 
computationally expensive. 

Of course there are schemes such as conjugate gradient or quasi-Newton that do exploit second 
order information even when the Hessian cannot be inverted. However, the special structure of the 
Lagrangian also allows second order information to be used for a simple variant of Newton descent. 
The associated update rule is called Nearest- Newton updating (Wolpert & Bieniawski 20046); we 
review it here. 

To derive Nearest-Newton we begin by considering the Lagrangian E 7r (G) — TS(ir), for an 
unrestricted probability distribution 7r. 8 This Lagrangian is a convex function of ir with a diagonal 
Hessian. So given a current distribution ir 1 we can make an unrestricted Newton step of this 
Lagrangian to a new distribution ir t+l . That new distribution typically is not in Q (i.e. not a 
product distribution), even if the starting distribution is. However we can solve for the q t+l G Q 
that is nearest to 7r* +1 , for example by finding the q t+1 G Q that minimizes qp KL distance D(p\\q) 
to that new point. 

More precisely, the Hessian of K n (G) — TS(ir), 3 2 £/97r(x)c?7r(x / ), is diagonal, and so is simply 
inverted. This gives the Newton update for ir 1 : 

vr t+1 (x) =vr*(x) -cv*7r*(x) 

which is normalized if 7r* is normalized and where a q is a step size. As 7r* will typically not belong 
to Q we find the product distribution nearest to ir t+1 by minimizing the KL distance D(7r* +1 ||g) 
with respect to q. The result is that qi(xi) = 7r' +1 (irj), i.e. qi is the marginal of ir t+1 given by 
integrating it over x_j. 



G (x)-M G) + ^, ) + W(x) 



10 



Thus, whenever 7r* itself is a product distribution, the update rule for qi(xi) is 

This update maintains the normalization of but may make one or more g* +1 (xj) greater than 
1 or less than 0. In such cases we set to be valid product distribution nearest in Euclidean 
distance (rather than KL distance) to the suggested Newton update. 



+ S( gi )+ In qj 



(10) 



3.3 Updating Lagrange Multipliers 

In order to satisfy the imposed optimization constraints {c a (x)} we must also update the Lagrange 
multipliers. To minimize communication between agents this is done in the simplest possible way 
- by gradient descent. Taking the partial derivatives with respect to A a gives the update rule 

Ai +1 = A^ + a!E^(c a (x)) (11) 

where a\ is a step size and g* is the local minimizer of C determined as above at the old settings, 
A*, of the multipliers. 



3.4 Other descent schemes 

It should be emphasized that PC encompasses many approaches to optimization of the Lagrangian 
that differ from those used here. For example, in (Bieniawski & Wolpert 2004a, Wolpert 2004c) 
there is discussion of alternative types of descent algorithms that are related to block relaxation, 
as well as to the fictitious play algorithm of game theory (Fudenberg & Tirole 1991, Shamma & 
Arslan 2004) and multi-agent reinforcement learning algorithms like those in collective intelligence 
(Wolpert & Turner 2001, Wolpert et al. 2003). 

As another example, see (Wolpert & Bieniawski 20046, Wolpert 20046) for discussions of using 
pq KL distance (i.e., D{p\\q)) rather than qp distance. Interestingly, as discussed below, that 
alternative distance must be used even for descent of qp distance, if one wishes to use 2nd order 
descent schemes. (Wolpert 2004a) discusses using non-Boltzmann target distributions p, and many 
other options for what functional(s) to descend. 



3.5 Algorithmic summary 

Having described two possible PC algorithms we summarize the steps involved in each. This basic 
framework will form the basis for the semicoordinate extensions described in Section HI Pseudocode 
for basic PC optimization appears in Algorithm [TJ Lines 1-3 initialize algorithmic parameters. 
The best starting temperatures and multiplier values vary from problem to problem. We typically 
initialize q to be the maximum entropy distribution which is uniform over the search space X. An 
outer loop decreases the temperature according to a a schedule determined by the function updateT. 
Later we comment on automatic schedules generated from the settings of Lagrange multipliers. 



Inside this loop is another loop which increments Lagrange multipliers according to Eq. (11) 
every time q is iterated to a local minimum of the Lagrangian. The minimization of L for a fixed 
temperature and setting of multipliers is accomplished in the innermost loop. This minimization is 
accomplished by repeatedly determining all the conditional probabilities lE ? _ i (G ! +^ a A a c a |xj) (line 



7), and then using these in either of the two update rules Eqs. (|9j), (10) (line 8). The evaluation 
of the conditional expectations of G + ^ a A a c a can be accomplished either analytically or with 
Monte Carlo estimates. For many problems analytical evaluation is prohibitively costly, and Monte 
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Input: An objective function G(x), and a set of equality constraint functions {c a (x) = 0}^ =1 
Output: A product distribution (?*(x) = Y\i1i( x i) peaked about the minimizer of G 
satisfying the constraints. 

1 A<- initializeM ultipliers 

2 T <— initializeT 

3 ?f- initia lizeQ 

4 repeat 

5 repeat 

6 repeat 

7 gs <— evalConditionalExpectations (G,q) 

8 q <— updateQ (gs) 

9 until atLocalMinimum(g) 
10 A <— updateA 

n until constraintsSatisfied 

12 T <— updateT 

13 until done 

Algorithm 1: The basic PC algorithmic framework. 

Carlo methods are the only option. In Appendix A we consider unbiased low variance Monte Carlo 
methods for estimating the required conditional expectations, and in Appendix B we derive the 
minimal variance estimator from within a class of useful estimators. 

4 Semicoordinate Transformations 
4.1 Motivation 

Consider a multi-stage game like chess, with the stages (i.e., the instants at which one of the players 
makes a move) delineated by t. In game theoretic terms, the "strategy" of a player is the mapping 
from board-configuration to response that specifies the rule it adopts before play starts (Fudenberg 
& Tirole 1991, Basar & Olsder 1999, Osborne & Rubenstein 1994, Aumann & Hart 1992, Fudenberg 
&; Levine 1998). More generally, in a multi-stage game like chess the strategy of player i, Xi, is the 
set of t-indexed maps taking what that player has observed in the stages t' < t into its move at 
stage t. Formally, this set of maps is called player i's normal form strategy. 

The joint strategy of the two players in chess sets their joint move-sequence, though in gen- 
eral the reverse need not be true. In addition, one can always find a joint strategy to result in 
any particular joint move-sequence. Now typically at any stage there is overlap in what the play- 
ers have observed over the preceding stages. This means that even if the players' strategies are 
statistically independent (being separately set before play started), their move sequences are statis- 
tically coupled. In such a situation, by parameterizing the space Z of joint-move-sequences z with 
joint-strategies x, we shift our focus from the coupled distribution P(z) to the decoupled product 
distribution, (?(x). This is the advantage of casting multi-stage games in terms of normal form 
strategies. 

More generally, given any two spaces X and Z, any associated onto mapping ( : Z — > X, 
not necessarily invertible, is called a semicoordinate system. The identity mapping id : Z — > Z 
is a trivial example of a semicoordinate system. Another semicoordinate system is the mapping 
from joint-strategies in a multi-stage game to joint move-sequences. In other words, changing the 
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representation space of a multi-stage game from move-sequences z to strategies x is a semicoordinate 
transformation of that game. 

Intuitively, a semicoordinate transformation is a reparameterization of how a game — a mapping 
from joint moves to associated payoffs — is represented. So we can perform a semicoordinate 
transformation even in a single-stage game. Say we restrict attention to distributions over X that 
are product distributions. Then changing £(■) from the identity map to some other function means 
that the players' moves are no longer independent. After the transformation their move choices 
- the components of z — are statistically coupled, even though we are considering a product 
distribution. 

Formally, this is expressed via the standard rule for transforming probabilities, 

P z (z G Z) 4 C(P X ) = J dx P x (x)6(z - C(x)), (12) 

where Px and Pz are the distributions across X and Z, respectively. To see what this rule means 
geometrically, recall that V is the space of all distributions (product or otherwise) over Z and that 
Q is the space of all product distributions over X. Let C(Q) be the image of Q in V. Then by 
changing £(■), we change that image; different choices of £(■) will result in different manifolds C(2)- 

As an example, say we have two players, with two possible moves each. So z consists of the 
possible joint moves, labelled (0, 0), (0, 1), (1, 0) and (1,1). Take X = Z, and choose C(0,0) = 
(0,0), C(0,1) = (1,1), C(1,0) = (1,0), and C(M) = (0,1). Say that q is given by qi ( Xl = 
0) = 92(^2 = 0) = 2/3. Then the distribution over joint-moves z is Pg(0,0) = Px(0,0) = 4/9, 
Pz(l,0) = Pz(l,l) = 2/9, Pz(0,l) = 1/9. So P z {z) / P z (z 1 )P z (z 2 ); the moves of the players are 
statistically coupled, even though their strategies X{ are independent. 

Any P z , no matter what the coupling among its components, can be expressed as ((Px) f° r 
some product distribution Px for and associated £(•) In the worst case, one can simply choose X to 
have a single component, with £(•) a bijection between that component and the vector z — trivially, 
any distribution over such an X is a product distribution. Another simple example is where one 
aggregates one or more agents into a new single agent, i.e., replaces the product distribution over 
the joint moves of those agents with an arbitrary distribution over their joint moves. This is related 
to the concept coalitions in cooperative game theory, as well as to Aumann's correlated equilibrium 
(Fudenberg & Tirole 1991, Aumann 1987, Aumann & Hart 1992). 

Less trivially, given any model class of distributions {Pz}, there is an X and associated £(■) such 
that {Pz} is identical to Q(Qx)- Formally this is expressed in a result concerning Bayes nets. For 
simplicity, restrict attention to finite Z. Order the components of Z from 1 to N. For each index 
i € {1,2,..., N}, have the parent function Pa(i, z) fix a subset of the components of z with index 
greater than i, returning the value of those components for the z in its second argument if that 
subset of components is non-empty. So for example, with N > 5, we could have Pa(l, z) = (z 2 , £5). 
Another possibility is that Pa(l,z) is the empty set, independent of z. 

Let A(Pa) be the set of all probability distributions Pz that obey the conditional independencies 
implied by Pa: V P z £ A(Pa),z G Z, 

N 

P z (z) = HPz(zi\P a (i,z)). (13) 

i=l 

By definition, if Pa(i,z)) is empty, Pz{zi\ Pa(i, z)) is just the i'th marginal of Pz, Pz{ z i)- As 
an example of these definitions, the dependencies {Pa(l,z) = [zi, £3), Pa(2, z) = Z4,Pa(3,z) = 
(),Pa(4,z) = ()} correspond to the family of distributions factoring as 

P(z) = P( Zl \z 2 ,z 3 )P(z 2 \z 4 )P(z 3 )P(z 4 ) 
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As proven in (Wolpert & Bieniawski 2004a), for any choice of Pa there is an associated set of 
distributions ((Qx) that equals ^4(Pa) exactly: 

Proposition: Define the components of X using multiple indices: For all i £ {1,2, . . . , N} and 
possible associated values (as one varies over z G Z) of the vector Pa(i,z), there is a separate 
component of x, Xj ; p a (i, z )- This component can take on any of the values that Z{ can. Define £(•) 
recursively, starting at i = N and working to lower i, by the following rule: V i E {1, 2, . . . , N}, 

[C(x)]j = x i;Pa(iiZ) . 

Then A(Pa) = C(Qx)- 

Intuitively, each component of x in the proposition is the conditional distribution Pz{z%\ Pa(i, z)) 
for some particular instance of the vector Pa(i,z). As illustration consider again the example 
{Pa(l,z) = (z2, Z3), Pa(2, z) = Z4,Pa(3,z) = (),Pa(4,z) = ()}. If each z% assumes the value or 
1, then x has 8 components £4, 2%, X2-o, X2 ; i, £i ; oo> a?i;Qi 3 £i ; i0) and xi ; n with each component also 
either or 1. The product distribution in is 

g(x) = O4(x4)g3(x3)92;0(x2;0)g2;l(x2;l)oi;00(a;i;00)g , l;0l(xi;0l)q , l;10(a;i;10)9l;ll(a;i;ll)- 

Under £ the distribution 04(2:4) is mapped to 04(2:4), «2;o(£2 ; o) is mapped to <72(22|z4 = 0), gi ; oi(2 ; i;Oi) 
is mapped to qi(z\\Z2 = 0, Z3 = 1), and so on. 

The proposition means that in principle we never need consider coupled distributions. It suffices 
to restrict attention to product distributions, so long as we use an appropriate semicoordinate 
system. Semicoordinate systems also enable the representation of mixture models over Z can 
be also be represented using products. However, before discussing mixture models we show how 
transformation of semicoordinate systems can in principle be used to escape local minima in L{q). 



4.2 Semicoordinate transformations and local minima 

To illustrate another application of semicoordinate transformations, we confine ourselves to the 
case where X = Z so that £ is a bijection on X. 

We assume that the domain of the ith of n variables has size \Xi\. Then \X\ = YIa=i 1^1 ^ s the 
size of the search space. Each coordinate variable Xi partitions the search space into \Xi\ disjoint 
regions. The partitions are such that the intersection over all variable coordinates yields a single x. 
In particular, the standard semicoordinate system relies on the partition [*, • • • , *,Xi = 0,*, ■ ■ ■ ,*], 
• • • , [*, • • • , *, Xi = \Xi\ — 1, *, • • ■ , *1 for each coordinate x,-. 



1 'a) shows 
1 'b) shows 



As a illustrative example, consider 3 binary variables where X = {0, l} 3 . Figure 
the 8 points in the search space represented in the standard coordinate system. Figure 

a shuffling of the 8 configurations under the permutation (0123456 7) — > (15264073). The 
resulting partitions of configurations are given in Table [TJ 

Such transformations can be used to escape from local minima of the Lagrangian. To see this 
consider a coordinate transformation £ from X to the new space Z such that z = C( x )) an d choose 
g(z) = g(x) (i.e. do not change the associated probabilities). Then the entropy contribution to the 
Lagrangian remains unchanged, but the expected G alters from ^ x G(x)g(x) to 9 

^G^(x)g(x) 4£ G (C(x))g(x) = £ G(C(z))g(z). 

XX z 

This means that the gradient of the maxent Lagrangian will typically differ before and after the ap- 
plication of In particular, what was a local minimum with zero gradient before the semicoordinate 
transformation may not be a local minimum after the transformation and the resultant shuffling 
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X\ Zl 




^3 Z 3 

(a) (b) 

Figure 1: (a) Original linear indexing for 3 binary variables x\,X2,x%. (b) Result after applying 
the transformation to the new variables z\, Z2, z%. 



xi = 

X\ = 1 


(0,0,0), (0,0,1), (0,1,0), (0,1,1) 
(1,0,0), (1,0,1), (1,1,0), (1,1,1) 


zi = 
Zl = 1 


(1,0,0), (0,1,0), (0,0,1), (1,1,1) 
(0,0,0), (1,1,0), (1,0,1), (0,1,1) 


X 2 = 
X 2 = 1 


(0,0,0), (0,0,1), (1,0,0), (1,0,1) 
(0,1,0), (0,1,1), (1,1,0), (1,1,1) 


z 2 = 

Z 2 = 1 


(1,0,0), (0,1,0), (1,0,1), (0,1,1) 
(0,0,0), (1,1,0), (0,0,1), (1,1,1) 


X 3 = 
x 3 = 1 


(0,0,0), (0,1,0), (1,0,0), (1,1,0) 
(0,0,1), (0,1,1), (1,0,1), (1,1,1) 


Z3 = 
^3 = 1 


(1,0,0), (0,1,0), (1,0,1), (0,1,1) 
(0,0,0), (1,1,0), (0,0,1), (1,1,1) 



Table 1 : Resultant partitions from the transformation of Figure [T 



of utility values. As difficult problems typically have many local minima in their Lagrangian, such 
semicoordinate transformations may prove very useful. 

A simple example is shown in [2](a) where a Lagrangian surface for 2 binary variables is shown. 
The utility values are G(0, 0) = 0, G(l, 0) = 25, G(0, 1) = 18, 1) = 2. If the temperature is 7 in 
units of the objective then the global minimum is at </i(0) = 0.95,^(0) = 0.91 where L = —0.82. 
At this temperature there is a suboptimal local minimum (indicated by the dot in the lower left) 
located at gj(0) = 0.14,^(0) = 0.08 where L = 0.83. 

There are a number of criteria that might be used to determine a semicoordinate transforma- 
tion to escape from this local minimum q* . Two simple choices are to select the transformation 
that minimizes the new value of the maxent Lagrangian (i.e., minimize L(q*)), or to select the 
transformation which results in the largest gradient of the maxent Lagrangian at q* , (i,e„ maxi- 
mize || ^ q L{q*) ||). For this simple problem the results of both these choices are shown as Figures 
|2|b) and [2](c) respectively. The transformation in each of these cases is determined by optimizing 
over semicoordinate transformations (permutations), while keeping the probabilities fixed, to either 
minimize the Lagrangian value at q* , or maximize the norm of the gradient at q*. 

In principle, this semicoordinate search can be embedded within any optimization to dynam- 
ically escape local minima as they are encountered. Importantly, the search criteria listed above 
require no look ahead in order to identify the best semicoordinate permutation. However, in prac- 
tice, since we can not search over arbitrarily large permutation spaces we must select a few of 
the variables and permute amongst their possible joint moves. 10 By composing such permutations 
we can easily account for the escape from multiple local minima. Heuristics for the selection of 
"permuting" variables, and the results of this procedure await future work. 
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Figure 2: (a) Original Lagrangian function. The suboptimal local minimum is located at q* which 
is indicated with a dot in the lower left corner, (b) The Lagrangian under the coordinate trans- 
formation which minimizes L{q*). (c) The Lagrangian under the coordinate transformation which 
maximizes the norm of the gradient at q* . The direction of the negative gradient is indicated by a 
black arrow. 
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4.3 Semicoordinate Transformations for Mixture Distributions 



In this section we turn to a different use of semicoordinate transformations. We have previously 
described how the Lagrangian measuring the distance of a product distribution to a Boltzmann 
distribution may be minimized in a distributed fashion. We now extend these results to mixtures 
of product distributions in order to represent multiple product distribution solutions at once. We 
can always do that by means of a semicoordinate transformation of the underlying variables. In 
this section we demonstrate this explicitly. 

Let X indicate the set of variables in a space of dimension dx, and let Z be the original (pre- 
transformation) n-dimensional space over which G is defined. We identify a product distribution 
over X (with dx > n), and an appropriately chosen mapping £ : X — > Z which induce a mixture 
distribution over Z. 

To see this consider an M component mixture distribution over n variables, which we write as: 

Q( z ) = Ef=i 9°(™)9 m (z) with Ylm=i Q°( rn ) = 1 an d q m {z) = YYi=i QTi^)- We express this q(z) 
as (the image of) a product distribution over a space X of dimension dx = 1 + Mn. Intuitively, 
the first dimension of X (indicated as x° £ [1,M]) labels the mixture, and the remaining Mn 
dimensions (indicated as x™ £ Zi) correspond to each of the original n dimensions for each of the 
M mixtures. 

More precisely, write out the X -space product distribution as qxfe) = q°(x°) Y\m=i <Z m ( xTra ) with 
q m (* m ) = njLi q™{ x ?) for x = [A* 1 , • • • ,x M ] and x m = [x^ 1 , ■ ■ ■ , x™\. The density in X and Z 
are related as usual by q(z) = ^ x g^(x)<5(z — C( x )) with the delta function acting on vectors being 
understood component- wise. If we label the components of ( so that Zi = d(x°, x 1 , ■ ■ ■ , x M ) = xf 
we find 

,(z)=j>v) £ n« m ( xm )n%-&(*v\ ■■■,*")) 

x° x 1 ,— ,x M m i 

x° x 1 ,- ,x M m i 

x° x x° i 

= E9°(x°)/(z) 

x° 

Thus, under ( the product distribution qx is mapped to the mixture of products q(z) = J2 m q (m)q m (^) 
(after relabelling x° to m), as desired. 

The maxent Lagrangian of the X product distribution qx(x) is 

M 



£x{qx) ^3°(m)V(G) - T ^(g ) + E 



m=l 

This Lagrangian contains a term pushing us (as we search for the minimizer of that Lagrangian) to 
maximize the entropy of the mixture weights, but it provides no incentive for the distributions q m 
to differ from each other. As we have argued, it is desirable to have the different mixtures capture 
different solutions, and so we modify the Lagrangian function slightly. 

If we wish the q m differ from one another, we instead consider the maxent Lagrangian defined 
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over q{z). In this case 



Cz(q) = E G ( Z M Z ) ~ TS ^) = E G(C(x))^(x) - TS(q) 

Z X 



The entropy term differs crucially in these two maxent Lagrangians. To see this add and subtract 
^ Em q°(m)S(q m ) to the Z Lagrangian to find 

£ 2 (g) = ^ ? (m)£(g m )-TJ( 9 ) (14) 

rn 

where each C{q m ) is a maxent Lagrangian as given by Eq. ([I]), and J(q) > is a modified version 
of the Jensen-Shannon (JS) distance, 

J(q) = S(E - E 9°(m)5(g m ) = - E E 1°(m)q m (z) ln 



<?" l (zj 



Conventional Jensen-Shannon distance is defined to compare two distributions to each other, and 
gives those distributions equal weight. In contrast, the generalized JS distance J(q) concerns 
multiple distributions, and weights them nonuniformly, according to q°(m). 

J(q) is maximized when the q m are all different from each other. Thus, its inclusion into the 
Lagrangian pushes the mixing components away (in X) from one another. In this, we can view 



Eq. ( 14 ) as a novel derivation of (a generalized version of) Jensen-Shannon distance. Unfortunately, 
the JS distance also couples all of the variables (because of the sum inside the logarithm) which 
prevents a highly distributed solution. 

To address this, in this paper we replace J{q) in Cz{q) with another function which lower- 
bounds J{q) but which requires less communication between agents. It is this modified Lagrangian 
that we will minimize. 

4.4 A Variational Lagrangian 

Following (Jaakkola & Jordan 1998), we introduce M variational functions w{z\m) and lower-bound 
the true JS distance. We begin with the identity 



J ^) = -EE^V)^)in 



1 -g°(m^^' m ^^ 



w(z\m) q°(m)q m (z) 
q°(m)q m (z) lnit;(z|m)) — q°(m) ln g°(m) 

m z m 

sr^ST^ Or \ mi \^ w(z\m)q(z) 
- 2^2^g u (m)g m (z)ln- 



(m)q m (z) 

Now replace M of the — ln terms with the lower bound — ln z > —vz + ln v + 1 obtained from the 
Legendre dual of the logarithm to find 

J(q) > J{q, w, v) — q°(m)q m (z) ln-u;(z|m) — q°(m) ln q°(m) 

m z m 

- S ^v m S ^w(z\m)q{z) + E'A™) lnv m + 1. 
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Optimization over w and v maximizes this lower bound. To further aid in distributing the algorithm 
we restrict the class of variational w{z\m) to products: w(z\m) = Y\ i Wi(zi\m). For this choice 



J(q, w,v)±Y, 1°( m ) I Bm ' m ~ Yl + ln u ™ \ + S(q°) + 1 

m \ fh ) 



(15) 



where Af' m = Y<z l ( lT( z i) w i( z i\ m )i Am,m ~ \[i=x A T' m ^ B T' m ~ Yj Zi QT&)^ w i&\ m )i and 
B m ' m = Yli=i B" l,m . ll At any temperature T the variational Lagrangian which must be mini- 
mized with respect to q, w and v (subject to appropriate positivity and normalization constraints) 
is then 

Cz(q, w,u)=Y J q°(m)C(q m ) - TJ(q, w, v) (16) 



with J(q,w,u) given by Eq. (15) 



5 Minimizing the Mixture Distribution Lagrangian 

Equating the gradients with respect to w and v to zero gives 



Wi{zi\m) oc 



Vra 

q°(m)qr(z l ) 



V, 



III 



Xy(m)^(*) 



A 



m,m 



(17) 
(18) 



The dependence of Cz on q°(m) is particularly simple: Cz(q, w, v) J2 m q°{m)£{'m')—T(S(q ) + l) 
up to (^-independent terms where 

£(m) 4 E gm (G) - T [S[q m ] + B m ' m -^A™^ + hxv m ] , 



Thus, the mixture weights are Boltzmann distributed with energy function £{m): 

exp(— £ (m)/T) 



q (m) 



Em ex P(-^(^)/ T ) 



(19) 



The determination of q™(zi) is similar. The relevant terms in Cz involving q™{zi) are Cz ~ 
q°(m)Z Zl £m(zi)qr(zi) ~TS(q™) where 

( v A 771 '" 1 \ 

£ m {zi) = E ? m (G|zj) - Tf Interim) - 2_, ^ v^Wi{zi\m)\. 

As before, the conditional expectation E g m. ((?|^i) is ^ 2 . G(zi, z-^q^ (z_j). The mixture proba- 
bilities are thus determined as 



exp(-f m (^)/T) 

E^ ex P(-^"m(^)/ T ) 



(20) 
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5.1 Agent Communication 



As desired these results require minimal communication between agents. A supervisory agent, call 
this the 0-agent, is assigned to manage the determination of q°(m), and (z,m)-agents manage the 
determination of qf l (zi). The M (i, m)-agents for a fixed % communicate their Wi(zi\m) to determine 
A ' . These results along with the B™' m from each (i, m) agent are then forwarded to the 0- agent 
who forms A m ' m and B m ' m broadcasts this back to all (i, m)-agents. With these quantities and the 
local estimates for K g m,(G\zi), all q™ can be updated independently. 



6 Experiments 

In this section we demonstrate our methods on some simple problems. These examples are illustra- 
tive only, and for further examples the reader is directed to (Bieniawski & Wolpert 2004 a, Bieniawski 
et al. 2004, Bieniawski & Wolpert 20046, Lee & Wolpert 2004, Wolpert & Lee 2004) for related 
experiments. 

We test the the mixture semicoordinate probability collective method on two different problems: 
a fc-sat constraint satisfaction problem having multiple feasible solutions, and optimization of an 
unconstrained optimization of an NK function. 



6.1 k-sat 

The fc-sat problem is perhaps the best studied CSP (Mezard, Parisi & Zecchina 2002). The goal 
is to assign ./V binary variables (labelled z,j) true/valse values so that C disjunctive clauses are 
satisfied. The ath clause involves k variables labelled by v a j € [1,N] (for j £ [l,k]), and k binary 

j. The ath clause is satisfied iff \/ k - =l [z Va j = a a j] 



values associated with each a and labelled by a a j. The ath clause is satisfied iff \/ k =1 [z 



so we define the ath constraint as 

c a (z) 



ifV*=iKi 

1 otherwise 



As the ath clause is violated only when all z Va . = a a j (with a defined to be not a), the Lagrangian 
over product distributions can be written as C(q) = A T c(g) — TS(q) where c(q) is the C-vector 
of expected constraint violations, and A is the C vector of Lagrange multipliers. The a'th compo- 
nent of c(q) is the expected violation of the ath clause, and is given by c a (q) = ^2 z c a (z)q(z) = 
Ili=i Qvajivaj)- Note that no Monte Carlo estimates are required to evaluate this quantity. Fur- 
ther, the only communication required to evaluate G, and its conditional expectations is between 
agents appearing in the same clause. Typically, this communication network is sparse; for the 
N = 100, k = 3, C = 430 variable problem we consider here each agent interacts with only 6 other 
agents on average. 

We first present results for a single product distribution. For any fixed setting of the Lagrange 



multipliers, the Lagrangian is minimized by iterating Eq. (10). If the minimization is done by the 
Brouwer method, any random subset of variables, no two of which appear in the same clause, can 
be updated simultaneously. This eliminates "thrashing" and ensures that the Lagrangian decreases 
at each iteration. 

The minimization is terminated at a local minimum q* which is detected when the norm of 
the gradient falls below a threshold. If all constraints are satisfied at q* we return the solution 



z* = arg max z q*(z), otherwise the Lagrange multipliers are updated according to Eq. (11). In 



the present context, this updating rule offers a number of benefits. Firstly, those constraints which 
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Figure 3: (a) Evolution of Lagrangian value (solid line), expected constraint violation (dotted line), 
and constraint violations of most likely configuration (dashed line), (b) P[G) after minimizing the 
Lagrangian for the first 3 multiplier settings. At termination P(G) = 5(G). 




Figure 4: Each constraint's Lagrange multiplier versus the iterations when they change. 
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are violated most strongly have their penalty increased the most, and consequently, the agents 
involved in those constraints are most likely to alter their state. Secondly, the Lagrange multipliers 
contain a history of the constraint violations (since we keep adding to A) so that when the agents 
coordinate on their next move they are unlikely to return a previously violated state. This mimics 
the approach used in taboo search where revisiting of configurations is explicitly prevented, and aids 
in an efficient exploration of the search space. Lastly, rescaling the Lagrangian after each update of 
the multipliers by 1 T A = ^ a A a gives C(q) = X T c(q) — TS(q) where A = A/1 T A and T = T/1 T A. 
Since ^ a A a = 1 the first term reweights clauses according to their expected violation, while the 
temperature T cools in an automated annealing schedule as the Lagrange multipliers increase. 
Cooling is most rapid when the expected constraint violation is large and slows as the optimum is 
approached. The parameters a\ thus govern the overall rate of cooling. We used the fixed value 
a\ = 0.5 in the reported experiments. 

Figure [3] presents results for a 100 variable k = 3 problem using a single mixture. The problem is 
satisfiable formula uf 100-01 . cnf from SATLIB (www. satlib. org|. It was generated with the ratio 
of clauses to variables being near the phase transition, and consequently has few solutions. Fig. [3]^a) 
shows the variation of the Lagrangian, the expected number of constraint violations, and the number 
of constraints violated in the most probable state z mp = arg max z q(z) as a function of the number of 
iterations. The starting state is the maximum entropy configuration of uniform qi, and the starting 
temperature is T = 1.5 ■ 10 -3 . The iterations at which the Lagrange multipliers are updated are 
indicated by vertical dashed lines which are clearly visible as discontinuities in the Lagrangian 
values. To show the stochastic underpinnings of the algorithm we plot in Fig. (3^b) the probability 
density of the number of constraint violations obtained as P(G) = ^ z q(z)5(G — Y^ a c a( z ))- 12 We 
see the downward movement of the expected value of G as the algorithm progresses. Figure [4] shows 
the evolution of the renormalized Langrange multipliers A. At the first iteration the multiplier for 
all clauses are equal. As the algorithm progresses weight is shifted amongst difficult to satisfy 
clauses. 

Results on a larger problem with multiple mixtures are shown in Fig. [5ja). This is the 250 
variable/1065 clause problem uf 250-01 . cnf from SATLIB with the first 50 clauses removed so that 
the problem has multiple solutions. The optimization is performed by selecting a random subset 
of variables, no two of which appear in the same clause, at each iteration, and updating according 
to Eqs. (17), (18), (19), and (20). After convergence to a local minimum the Lagrange multipliers 
are updated as above according to the expected constraint violation. The initial temperature is 
10 _1 . We plot the number of constraints violated in the most probable state of each mixture as a 
function of the number of updates, as well as the expected number of violated constraints. After 
8000 steps three distinct solutions are found along with a fourth configuration which violates a 
single constraint. 



6.2 Minimization of NK Functions 

Next we consider an unconstrained discrete optimization problem. The NK model defines a family 
of tunably difficult optimization problems (Kauffman & Levin 1987). The objective over N binary 
variables {zi]f =l is defined as the average of N randomly generated contributions depending on 
Z{ and K other randomly chosen variables zj ■ ■ ■ zf: G(z) = iV -1 YliLi Ei( z fi z h ' ' ' z f)- For each 
of the 2 K+1 local configurations E{ is assigned a value drawn uniformly from to 1. K which 
ranges between and N — 1 controls the number of local minima; under Hamming neighborhoods 
K = optimization landscapes have a single global optimum and K = N — 1 landscapes have 
on average 2 N /(N + 1) local minima. Further properties of NK landscapes may be found in 
(Durrett &: Limic 2003). Fig. ^tb) plots the energy of a 5 mixture model for a multi-modal 
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Figure 5: (a) The solid curves show the number of unsatisfied clauses in the most probable config- 
uration z mp of each of the 4 mixtures vs iterations. The topmost solid black line plots the expected 
number of violations, and the dashed black line shows the approximation to the JS distance, (b) 
The solid curves show the evolution of the G value of the best z mp configurations for each of 5 mix- 
tures versus number of iterations. The dashed black line shows the corresponding approximation 
to the JS distance. 



N = 300 K = 2 function. The K — 1 spins other than i upon which E{ depends were selected at 
random. At termination of the PC algorithm (at a small but non-zero temperature), five distinct 
configurations are obtained with the nearest pair of solutions having Hamming distance 12. Note 
that unlike the fe-sat problem which has multiple configurations all having the same global minimal 
energy, the JS distance (the dashed curve) of Fig. |5jb) drops to zero as the temperature decreases. 
This is because at exactly zero temperature there is no term forcing different solutions, and the 
Lagrangian is minimized by having all mixtures converge to delta functions at the lowest objective 
value configuration. The effect of mixtures enables the algorithm to simultaneously explore multiple 
local externa before converging on the lowest objective solution. 



7 Relation of PC to other work 

There has been much work from many fields related to PC. The maxent Lagrangian has been used 
in statistical physics for over a century under the rubric of "free energy" . Its derivation in terms of 
information theoretic statistical inference was by Jaynes (Jaynes 1957). The maxent Lagrangian has 
also appeared occasionally in game theory as a heuristic, without a statistical inference justification 
(be it information-theoretic or otherwise) (Fudenberg & Levine 1998, Shamma & Arslan 2004). 13 
In none of this earlier work is there an appreciation for its relationship with the related work in 
other fields. 

In the context of distributed control/optimization, the distinguishing feature of PC is that it 
does not view the variable x as the fundamental object, but rather the distribution across it, q. 
Samples of that distribution are not the direct object of interest, and in fact are only used if 
necessary to estimate functionals of q. The fundamental objective function is stated in terms of q. 
As explicated in the next subsection, the associated optimization algorithms are related to some 
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work in several fields. Heretofore those fields have been unaware of each other, and of the breadth 
of their relation to information theory and game theory. 

Finally, we note that the maxent or qp Lagrangian C(q) = E q [G) — TS(q) can be viewed as 
a barrier-function (interior point) method with objective E q {G). An entropic barrier function is 
used to enforce the constraints qi(xi) > Vi and Xj, with the constraint that all qi sum to 1 being 
implicit. 

7.1 Various schemes for updating q 

We have seen that the qp Lagrangian is minimized by the product distribution q given by 



Direct application of these equations that minimize the Lagrangian form the basis of the Brouwer 
update rules. Alternatively, steepest descent of the maxent Lagrangian forms the basis of the 
Nearest Newton algorithm. These update rules have analogues in conventional (non-PC) opti- 
mization. For example, Nearest-Newton is based on Newton's method, and Brouwer updating is 
similar to block-relaxation. This is one of the advantages of embedding the original optimization 
problem involving x in a problem involving distributions across x: it allows us to solve problems 
over non-Euclidean (e.g., countable) spaces using the powerful methods already well-understood 
for optimization over Euclidean spaces. 

However there are other PC update rules that have no direct analogue in such well-understood 
methods for Euclidean space optimization. Algorithms may be developed that minimize the pq 
Lagrangian D(pr\\q) where pr(x) = exp(— G(x)/T)/Z(T) with Z(T) being the normalization of 
the Boltzmann distribution. The pq Lagrangian is minimized by the the product of the marginals of 
the Boltzmann distribution, i.e. <fe(xj) = J cZx_j py(x). Another example of update rules without 
Euclidean analogues are the iterative focusing update rules described in (Wolpert 2004a). Iterative 
focusing updates are intrinsically tied into the fact that we're minimizing (the distribution setting) 
an expectation value. 

A subset of update rules arising from qp and pq Lagrangians are described in (Wolpert 2004a). 
In all cases, the updates may be written as multiplicative updating of q. The following is a list of 
the update ratios r q ^{xi) = q 1 ^ 1 (xi) / q\(xi) of some of those rules. 

In all of the rules in this list, Fq is a probability distribution over x that never increases between 
two x's if G does (e.g., a Boltzmann distribution in G{x)). In addition, const is a scalar that ensures 
the new distribution is properly normalized and a is a stepsize. 14 Finally, "iterative focusing" is a 
technique that repeats the following process: It takes a distribution q as input. It then produces as 
output a new distribution that is q "focused" more tightly about x's with good G(x). That focused 
distribution becomes q for the next iteration. See (Wolpert et al. 2006). 

Gradient descent of qp distance to Fq: 



qi(xi) oc exp(-£ , (? _ i (G|xj 



)/T). 



(21) 




(22) 



Nearest Newton descent of qp distance to Fq: 



1 -a{K qt (In F G \ Xi ) +ln(q t i (x i 



))} 



const 



(23) 



Brouwer updating for qp distance to Fq: 



e E qt (In F G \xi) 



(24) 



const x 
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Importance sampling minimization of pq distance to Fq: 

const x E q t(^-\ Xl ^j (25) 

Iterative focusing of q with focusing function Fq using qp distance and gradient descent: 

t _ E ,(lnF c |x)^ ln |(S> 

I i q\xi) 

Iterative focusing of q with focusing function Fq using qp distance and Nearest Newton: 

l-a\E q t(\nF G \ Xi ) + ln^^|- const (27) 

Iterative focusing of q with focusing function Fq using qp distance and Brouwer updating: 

const x e M ln ^) x S^L (28) 

Iterative focusing of q with focusing function Fq using pq distance: 

const x Eq(F G \xi) x (29) 

Note that some of these update ratios are themselves proper probability distributions, e.g., the 
Nearest Newton update ratio. 

This list highlights the ability to go beyond conventional Euclidean optimization update rules, 
and is an advantage of embedding the original optimization problem in a problem over a space of 
probability distributions. Another advantage is the fact that the distribution itself provides much 
useful information (e.g., sensitivity information). Yet another advantage is the natural use of Monte 
Carlo techniques that arise with the embedding, and allow the optimization to be used for adaptive 
control. 



7.1.1 Relation of PC to other distribution-based optimization algorithms 

There is some work on optimization that precedes PC and that has directly considered the dis- 
tribution q as the object of interest. Much of this work can be viewed as special cases of PC. 
In particular deterministic annealing (Duda et al. 2000) is "bare-bones" parallel Brouwer updat- 
ing. This involves no data-aging (or any other scheme to avoid thrashing of the agents), difference 
utilities, etc.. 15 

More tantalizingly, the technique of probability matching (Sabes & Jordan 1996) uses Monte 
Carlo sampling to optimize a functional of q. This work was in the context of a single agent, and 
did not exploit techniques like data-ageing. Unfortunately, this work was not pursued. 

Other work has both viewed q as the fundamental object of interest and used techniques like 
data-aging and difference utilities. In particular, this is the case with the Collective INtelligence 
(COIN) work (Wolpert, Turner & Frank 1999, Wolpert et al. 2003, Wolpert 2003c). However this 
work was not based on information-theoretic considerations and had no explicit objective function 
for q. It was the introduction of such considerations that resulted in PC. 
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Another interesting body of early work is the cross entryop (CE) method (Rubinstein & Kroese 
2004). This work is the same as iterative focusing using pq distance. The CE method does not 
consider the formal difficulty with iterative focusing identified in (Wolpert et al. 2006), or any of 
the potential solutions to that problem discussed there. That difficulty means that even with no 
sampling error (i.e., if all estimated quantities are in fact exactly correct), in general there are no 
guarantees that the algorithm converges to an optimal x. 

Other early work grew out of the Genetic Algorithms community. This work was initiated 
with MIMIC (De Bonet, Isbell Jr. & Viola 1997), and has since developed into the Estimation of 
Distribution Algorithms (EDA) approach (Lozano, Larrahaga, Inza & Bengoetxa 2006). A number 
of important issues were raised in this early work, e.g., the importance of information theoretic 
concepts. 

For the most part however, especially in its early stages (e.g., with MIMIC), this work viewed 
the samples as the fundamental object of interest, rather than view the distribution being sampled 
that way. Little concern arises for what the objective function of the distribution being sampled 
should be, and of how samples can be used to achieve that optimization. 

This means that there is little concern for issues like the (lack of) convexity of that implicit 
objective function. This prevents EDA's from fully exploiting the power of continuous space op- 
timization, e.g., the absence of local minima with convex problems (Boyd & Vandenberghe 2003). 
Similarly, it means that with EDA's there is no concern for cases where the distribution objec- 
tive function can be optimized in closed form, without any need for sampling at all. Nor is there 
widespread appreciation for how old sample x's can be re-used to help guide the optimization (as 
for example they are in PC's adaptive importance sampling (Wolpert et al. 2006)). 

This contrasts with PC, whose distinguishing feature is that it does not treat the variable x as 
the fundamental object to be optimized, but rather the distribution across it, q. So for example, 
in PC, samples of that distribution are only used if necessary to estimate quantities that cannot 
be evaluated other ways; the fundamental objective function is stated in terms of q. Indeed, in 
the cases considered in this paper, as well as earlier PC work like that reported in (Macready & 
Wolpert 2004), no such samples of q arise. 

Shortly after the introduction of PC, a variant of its Monte Carlo version of parallel Brouwer 
updating has been introduced, called the MCE method (Rubenstein 2005). In this variant the 
annealing of the Lagrangian doesn't involve changing the temperature (3, but instead changing the 
value of a constraint specifying K q (G). Accordingly, rather than jump directly to the (/^-specified) 
solution given above, one has to solve a set of coupled nonlinear equations relating all the 
(Another distinguishing feature is no data-ageing, difference utilities or the like are used in the 
MCE method.) The MCE method has been justified with the KL argument reviewed above rather 
than with "ratchet" -based maximum entropy arguments. This has redrawn attention to the role 
of the argument-ordering of the KL distance, and how it relates Brouwer updating and the CE 
method. 

Another body of work related to PC is (loopy) propagation algorithms, Bethe approximations, 
and the like (Mackay 2003). These techniques can be seen as an alternative to semicoordinate 
transformations for how to go beyond product distributions. Unlike those approaches, we are 
guaranteed to reach a local minimum of free energy. (If we were to use pq KL distance rather than 
qp KL distance, we would get to a global minimum of free energy.) In addition, via utilities like AU 
and WLU (see appendix), we can exploit variance reduction techniques absent from those other 
techniques. Similarly those other techniques do not make use of data-aging. 

Finally, there is also work that has been done after the introduction of both the CE method 
and PC that is closely related to both. Such methods are typically sample based, and a good 
summary of these approaches can be found in (Miihlenbein & Hons 2005). The use of junction tree 



26 



factorizations for q utilized in the above methods can be extended to PC theory, and results along 
these lines will be presented elsewhere. 

8 Conclusion 

A distributed constrained optimization framework based on probability collectives has been pre- 
sented. Motivation for the framework was drawn from an extension of full-rationality game theory 
to bounded rational agents. An algorithm that is capable of obtaining one or more solutions simul- 
taneously was developed and demonstrated on two problems. The results show a promising, highly 
distributed, off-the-shelf approach to constrained optimization. 

There are many avenues for future exploration. Alternatives to the Lagrange multiplier method 
used here can be developed for constraint satisfaction problems. By viewing the constraints as 
separate objectives, a Pareto-like optimization procedure may be developed whereby a gradient 
direction is chosen which is constrained so that no constraints are worsened. This idea is motivated 
by the highly successful WalkSAT (Selman, Kautz & Cohen 1996) algorithm for fc-sat in which 
spins are flipped only if no previously satisfied clause becomes unsatisfied as a result of the change. 

Probability collectives also offer promise in devising new methods for escaping local minima. 
Unlike traditional optimization methods where monotonic transformations of the objective leave lo- 
cal minima unchanged, such transformations will alter the local minima structure of the Lagrangian. 
This observation, and alternative Lagrangians (see (Rubinstein 2001) for a related approach using 
a different minimization criterion) offer new approaches for improved optimization. 

ACKNOWLEDGEMENTS We would like to thank Charlie Strauss for illuminating conversa- 
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Notes 

1 \S\ denotes the number of elements in the set 5. 

2 In this paper vectors are indicated in bold font and scalars are in regular font. 

3 Despite its popularity, the KL distance D(q\\p) between two probability distributions q and p is not a proper 
metric. It is not even symmetric between q and p. However it is non-negative, and equals zero only when q = p. 

4 Here and throughout, we fix the convention that it is desirable to minimize objective functions, not to maximize 
them. 

5 We adopt the notation that q-i indicates the distribution q with variable i marginalized out, i.e., the product 
Hj^iU- Analogously, x_i = [xi,--- , Xi-i, x i+ i, ■ ■ ■ ,x n ] 

6 Properly speaking one should find the global minimizer q. Here we content ourselves with finding local minima. 

7 N.b., we do not project onto Q but rather add a vector to get back to it. See (Wolpert & Bieniawski 2004b). 
For such a distribution we relax the requirement of being a product or having any other particular form; we only 
require that all < 7r(x) < 1 and Yl x ^(x) — 1- 

9 The Jacobian factor is irrelevant as £ is a permutation. 

10 Alternatively, we can parameterize a smaller space of candidate permutations and select the best from amongst 
this candidate set. 

11 Note that if Wi(zi\m) = 1/|-Z;| is uniform across «j then A™' m = \j\Zi\ and B™' m — — \a.\Zi\. Maximizing over 
v m we find that J(q, w = 1/\Z\, v = v*) — 0. Thus, maximizing with respect to w increases the JS distance from 0. 

12 In determining the density 10 4 samples were drawn from q(z) with Gaussians centered at each value of G(z, 1) 
and with the width of all Gaussians determined by cross validation of the log likelihood. The fact that there is 
non-zero probability of obtaining non-integral numbers of constraint violations is an artifact of the finite width of the 
Gaussians. 

13 However an attempt at a first-principles derivation can be found in (Meginniss 1976). 

14 As a practical matter, both Nearest Newton and gradient-based updating have to be modified in a particular 
step if their step size is large enough so that they would otherwise take one off the unit simplex. This changes the 
update ratio for that step. See (Wolpert & Bieniawski 20046). 
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15 Indeed, as conventionally cast, deterministic annealing assumes the conditional G can be evaluated in closed 
form, and therefore has no concern for Monte Carlo sampling issues. 
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A Statistical estimation to update q 

Using either of the update rules Eqs. ([9| or ( p7o| ) requires knowing K q t (G\xi), defined in Eq. Q. 
However, often we cannot efficiently calculate all the terms E^t (G\xi). To use our update rules in 
such situations we can use Monte Carlo sampling, as described in this section. 
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A.l Monte Carlo sampling 



In the Monte Carlo approach, at each timestep every agent i samples its distribution g$ to get a 
point Xi. Since we have a product distribution q, these samples provides us with a sample x of the 
full joint distribution q. By repeating this process L times we get a "block" of such joint samples. 
The G values in that block can be used by each agent separately to estimate its updating values 
E ? t (G\xi), for example simply by uniform averaging of the G values in the samples associated with 
each Xi. Note that the single set of samples can be used no matter how many agents are in the 
system; we don't need a different Monte Carlo process for each agent to estimate their separate 
E qt (G\ Xi ). 

All agents (variables) sample moves (variable settings) independently, and coupling occurs only 
in the updates of the qi. As we have seen this update (even to second order) for agent i depends 
only on the conditional expectations K q _ i (G\xi) where q-i describes the strategies used by the 
other agents. Thus, if we are using Monte Carlo, then the only information which needs to be 
communicated to each agent is the G values upon which the estimate will be based. Using these 
values each agent independently updates its strategy (its qi) in a way which collectively is guaranteed 
to lower the Lagrangian. 

If the expectation is evaluated analytically, the ith agent needs the qj distributions for each of 
the j agents involved in factors in G that also involve i. For objective functions which consists of 
a sum of local interactions each of which individually involves only a small subset of the variables 
(e.g. the problems considered here), the number of agents that i needs to communicate with is 
typically much smaller than n. 



A. 2 Difference utilities for faster Monte Carlo convergence 

The basic Monte Carlo approach outlined above can be slow to converge in high- dimensional prob- 
lems. For the problems considered in this paper this is irrelevant, since E t (G\xi) may be efficiently 
calculated in closed form for all agents i and their moves Xi, so we don't need to use Monte Carlo 
sampling. For completeness though here we present details of an improvement of the basic Monte 
Carlo approach that converges far more quickly. 



Scrutiny of the two update rules f\9h and (10) reveals that we don't actually require the values 



E g t (G\xi) for all X{. All that is needed are the differences E q t (G\xi)—K q t (G|x£) for pairs of distinct 
moves Xi and x\. (For both update rules the other degrees of freedom in the values K q t (G\xi) are 
absorbed into the factor in the update rule ensuring normalization of qi.) In other words, to update 
qi we can replace the sample averages of G(x) with the sample averages of the associated difference 
utility g*(x) = G(x) — D J (x_j) for any function D*(x_j). The differences in expectation values 
E ? * (g l \xi) — E ? t (g l |a^) are invariant with respect to choice of D l . However the choice of D % can 
have a major effect on the statistical accuracy of our Monte Carlo estimation. 

We exploit the freedom granted by the introduction of D*(x_j) to derive quantities which may 
be more easily estimated by Monte Carlo methods. In particular, we may define D l by requiring 
that the maximum likelihood estimator of the gradient has the lowest possible variance. In appendix 
IB] we show that the choice 



O'(x-i) 



Y.— 

^ Li~.ii 



r V G ^' X -l (30) 



^ L 



gives the lowest possible variance. In the above equation L Xi is the numbers of times that the ith 
variable assumed value Xi in a block of L Monte Carlo samples. 
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The associated difference utility, g l (x.) = G(x) — Z)*(x_j), is called the Aristocrat utility (AU). 
An approximation to it was investigated in (Wolpert & Turner 2001, Wolpert, Turner & Bandari 
2004, Wolpert & Turner 2002) and references therein. AU itself was derived in (Wolpert 20036). 
AU minimizes variance of gradient descent updating regardless of the form of q. 

Sometimes not all the terms in the sum in AU can be stored, because \Xi\ and/or the block size 
is too large. That sum can be approximated by replacing the L Xi values in the definition of AU 
with q(xi)L. This replacement also provides a way to address cases where one or more L Xi = 0. 16 
Similarly, for computational reasons it may be desirable to approximate the weighted average of G 
over all x\ which defines AU. 

The sum over x[ occurring in AU should not be confused with the sum over x\ in Eq. (|8]) 
that pulls the gradient estimate back into the unit simplex. The sum here is over values of G for 
counterf actual sample pairs (a^,x_j). (The other sum is over values of our gradient estimate at 
all of its arguments.) When the functional form of G is known it is often the case that there is 
cancellation which allows AU be calculated directly, in one evaluation, an evaluation which can be 
cheaper than that of G(x). When this is not the case their evaluation incurs a computational cost 
in general. 17 This cost is offset by the fact that those evaluations allow us to determine the value 
of AU not just for the actual point (xj,x_j), but in fact for all points {(a^,x_j)|a^ £ X{\. 

Nonetheless, there will be cases where evaluating AU requires evaluating all possible G(a^,x_j), 
and where the cost of that is prohibitive, even if it allows us to update AU for all Xi at once. 
Fortunately there are difference utilities that are cheaper to evaluate than AU while still having 
less variance than G. In particular, note that the weighting factor L~} / Ylx" ^V' m ^ e f° rm ula for 
AU is largest for those Xi which occur infrequently, i.e. that have low qi(xi). This observation leads 
to the Wonderful Life Utility (WLU), which is an approximation to AU that (being a difference 
utility) also has zero bias: 

gf LU (x u ^ t ) = Gixi,*^) - G(xf mp ,^i). (31) 

In this formula, x^ lamp = arg min x . L Xi or if we wish to be more conservative, arg mm x .qi(xi), 
agent i's lowest probability move (Wolpert 2003a, Wolpert 20046). 18 

Given the derivation of AU, one would expect that in general WLU will work best if xf amp is 
set to the x that is least likely. As the run unfolds, which x has that property will change. However 
changing x^ lamp accordingly will void the assumptions underlying the derivation of AU. In practice 
then, while one usually can update xf amp in such a dynamic fashion occasionally, doing so too 
frequently can cause problems. 

A. 3 Discussion of Monte Carlo sampling 

Note that the foregoing analysis breaks down if any of the L Xi = 0. More generally it may break 
down if just one or more of the q(xj) are particularly small in comparison to the others, even if 
no L Xi is exactly zero. The reason for this is that our approximation of the average over nf (see 
Appendix [B]) with the average where no L Xi = breaks down. Doing the exact calculation with no 
such approximation doesn't fix the situation — once we have to assign non-infinitesimal probability 
to L Xi = 0, we're allowing a situation in which the gradient step would take us off the simplex of 
allowed q £ Q. We might try to compensate for this by reducing the stepsize, but in general 
the foregoing analysis doesn't hold if stepsize is reduced in some situations but not in any others. 
(Variable stepsize constitutes a change to the update rule. Such a modification to the update rule 
must be incorporated into the analysis — which obviates the derivation of AU.) 
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One way to address this scenario would be to simply zero out the probability of agent i making 
any move x% for which q%{xi) is particular small. In other words, we can redefine z's move space to 
exclude any moves if their probability ever gets sufficiently small. This has the additional advantage 
of reducing the amount of "noise" that agents j ^ i will see in the next Monte Carlo block, since 
the effect of agent i on the value of G in that block is more tightly constrained. 

There several ways to extend the derivation of AU, which only addresses estimation error for 
a single agent at a time, and for just that agent's current update. One such extension is to have 
agent i's utility set to improve the accuracy of the update estimation for agents j / i. For example, 
we could try to bias qi to be peaked about only a few moves, thereby reducing the amount of noise 
those other agents j^i will see in the next Monte Carlo block due to variability in i's move choice. 
Another extension is to have agent i's utility set to improve the accuracy of its estimate of its 
update for future Monte Carlo blocks, even at the expense of accuracy for the current block. 

Strictly speaking, the derivation of AU only applies to gradient descent updating of q. Difference 
utilities are unbiased estimators for the Nearest Newton update rule, so long as each i estimates 
E g t((f) as qj(xi) times the estimate of ¥, q t(g l \xi), 

Xi 

rather than as the empirical average over all samples of g 1 , 19 

X{ 

However the calculation for how to minimize the variance must be modified. Redoing the algebra 
above, the analog of AU for the Nearest Newton rule arises if we replace 




throughout the equation defining AU. Similar considerations apply to Brouwer updating as well. 
Nonetheless, in practice AU and WLU as defined above work well (and in particular far better than 
taking g l = G) for the other updating rules as well. 

For gradient descent updating, minimizing expected quadratic error of our estimator of E t (G\xi) 
corresponds to making a quadratic approximation to the Lagrangian surface, and then minimizing 
the expected value of the Lagrangian after the gradient step (Wolpert 2003a). More generally, 
and especially for other update rules, some other kind of error measure might be preferable. Such 
measures would differ from the bias- variance decomposition. We do not consider such alternatives 
here. 

Note that the agents are completely "blind" in the Monte Carlo process outline above, getting 
no information from other agents other than the values of G(x). When we allow some information 
to be transmitted between the agents we can improve the estimation of ¥, q t (G\xi) beyond that 
of the simple Monte Carlo process outlined above. For example, say that at every timestep the 
agent i knows not just its own move Xi, but in fact the joint move x. Then as time goes on it 
accumulates a training set of pairs {(x, G(x))}. These can be used with conventional supervised 
learning algorithms (Duda et al. 2000) to form a rough estimate, G, of the entire function G. Say 
that in addition i knows not its own distribution Oj(x'), but in fact the entire joint distribution, 
g(x*). Then it can use that joint distribution together with G to form an estimate of E q t (G\xi). 
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That estimate is in addition to the one formed by the blind Monte Carlo process outlined above. 
One can then combine these estimates to form one superior to both. See (Lee & Wolpert 2004). 

Even when we are restricted to a blind Monte Carlo process, there are many heuristics that 
when incorporated into the update rules that can greatly improve their performance on real- world 
problems. In this paper we examine problems for which joint distributions q are known to all agents 
as well as the function form of Ghc) and the required expectations E t (G\xi) may be obtained in 
closed form. So there is no need for Monte Carlo approximations. Accordingly there is no need for 
those heuristics, and there is not even any need to using difference utilities. Empirical investigations 
of the effects of using difference utility functions and the heuristics may be found in (Bieniawski 
& Wolpert 2004a, Bieniawski et al. 2004, Bieniawski & Wolpert 20046). 



B Utility Derivation 

Say we are at a timestep t at the end of a Monte Carlo block, and consider the simplest updating 
rule. This is gradient descent updating, in which we wish to update qi at a timestep t by having 
each agent i take a small step in the direction (cf. Eq. (p|) 



m,G A 



dC(q*) dC{q l 



'%(xf d ) 



where rji(q) was defined in Eq. (|8]), 1 is the vector of length \X{\ all of whose components are 1, 
and xj,--- ,x^^ are the \Xi\ moves available to agent i. In general, there will be some error in 
i's estimate of that step, since it has limited information about glj. Presuming quadratic loss 
reflects quality of the update, for agent i the Bayes-optimal estimate of its update is the posterior 
expectation 

J dq l P(qU | m) f*' G 

where rtj is all the prior knowledge and data that i has, and the dependence of P ,G on q t _ i is 
implicit. 20 P(q^ i \rii) is a probability distribution over likely values of q t _ i given the information 
available to agent i. 

Now agent i can evaluate lngj(xj) for each of its moves X{ exactly. However to perform its 
update it still needs the integrals 



J dq l P((ti\ni)KfJG\xi) 



(recall Eq. Q). In general these integrals can be very difficult to evaluate. As an alternative, we 
can replace those integrals with simple maximum likelihood estimators of them, i.e., we can use 
Monte Carlo sampling. In this case, the prior information, rij, available to the agent is a list, £, of 
L joint configurations x along with their accompanying objective values G(x). 

To define this precisely, for any function h(x), let h(n,) be a vector of length \Xi\ which is 
indexed by x%. The x% component of h(rij) is indicated as h Xi {m). Each of its components is given 
by the information in rij. The Xj'th such component is the empirical average of the values that h 
had in the L Xi samples from the just-completed Monte Carlo block when agent i made move Xi, 
i.e. 

h Xi {rii) = — h(x) 
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where £ Xi is the set of x in £ whose ith component is equal to Xi, and where L Xi = \£ Xi \- Given 
this notation, we can express the components of the gradient update step for agent i under the 
simple maximum likelihood estimator as the values 

fifim) = -{G Xi ( ni ) + T In gi ( Xi )} - fj(m) (33) 

where 

Vi(m) = ^^{GAn,) +Tlnq l (x' t )}. (34) 
1 il < 

Unfortunately, often in very large systems the convergence of G{rii) with growing L is very 
slow, since the distribution sampled by the Monte Carlo process to produce rii is very broad. This 
suggests we use some alternative estimator. Here we focus on estimators that are still maximum 
likelihood, just with a different choice of utility. In particular, we will focus on estimators based on 
difference utilities, briefly introduced in the preceding appendix. 

To motivate such estimators more carefully, first posit that the differences E q t (G\xi)—E q t (G|a^), 
one for each (xi,x'j) pair, are unchanged when one replaces G with some other function gi. So the 
change is equivalent to adding a constant to G, as far as those differences are concerned. This 
means that if agent i used q^ to evaluate its expectation values exactly, then its associated update 
would be unchanged under this replacement. (This is due to cancellation of the equivalent additive 
constant with the change that arises in rji(rii) under the replacement of G(x) with g*(x)). It is 
straight-forward to verify that the set of all g l guaranteed to have this character, regardless of the 
form of q, is the set of difference utilities, g l {x) = G(x) — D J (x_j) for some function D l . G itself is 
the trivial case D l (x_j) = Vxj. 

On the other hand, if we use a difference utility rather than G in our maximum likelihood 
estimator then it is the sample values of P{g l ) that generate Hi, and we use the associated X{- 
indexed vector g x .(ni) rather than G Xi (rii) to update each For well-chosen D l it may typically 
be the case that g l {ni) has a far smaller standard deviation than does G(rii). In particular, if 
the number of coordinates coupled to i through G does not grow as the system does, often such 
difference utility error bars will not grow much with system size, whereas the error bars associated 
with G will grow greatly. Another advantage of difference utilities is that very often the Monte 
Carlo values of a difference utility are far easier to evaluate than are those of G, due to cancellation 
in subtracting D l . 

To make this more precise we can solve for the difference utility with minimal error bar. First 
as a notational matter, extend the definition of P' G by replacing G with (arbitrary) h throughout, 
writing that extended version as P' h . Then assuming no L Xi = 0, we are interested in the g i 
minimizing the data- averaged quadratic error, 

E gi (Err) = f dq^ P(q^) f dnf P(nf\nf g*)^' - f^\n t )\\ 2 , (35) 

where P{q-i) reflects any prior information we might have concerning q-i (e.g., that it is likely that 
the current i i,gl is close to that estimated for the previous block of L steps), and nf is the set of 
values of the private utility contained in n^. (The associated values, nf, are independent of g l 
and q-i and therefore for our purposes can be treated as though they are fixed.) 

Now the components of P' 9 (n-j) (one for each xi) are not independent in general, being coupled 
via fji(ni). To get an integrand that involves only independent variables, we must work with only 
one Xi component at a time. To that end, rewrite the data-averaged quadratic error as 

£ / dq_ % P{q- t ) f dnf P(nf \nf , q. t , g*)[f x f - f x f {m)f 
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where f x f is the qi(xi) component of P' 9 ' . Our results will hold for all q-i, so we ignore the outer 
integral and focus on 

Y.j' hl '! P(4\^,q- t ,g i mf -fifing. (36) 

For any xt the inner integral can be decomposed with the famous bias-variance decomposition into 
a sum of two terms (Duda et al. 2000). 21 The first of the two terms in our sum is the (square of 
the) bias, flf — E g i(f x f ), where 



E 



Cftf(nf)) = Jdnf Pinflnlq^g^fifim) (37) 



is the expectation (over all possible sets of Monte Carlo sample utility values nf) of fxf (ni). The 

bias reflects the systematic trend of our sample-based estimate of f x f to differ from the actual 

fxf • When bias is zero, we know that on average our estimator will return the actual value it's 
estimating. 

The second term in our sum is the variance, 

Var(£f(nf)) 4 Jdnf P(nf |<,^,<f) {f x f (n,) - (f x f )} 2 , (38) 

In general variance reflects how much the value of our estimate "bounces around" if one were to 
resample our Monte Carlo block. In our context, it reflects how much the private utility of agent 
i depends on its own move X{ versus the moves of the other agents. When i's estimator is isolated 
from the moves of the other agents f x f (ni) is mostly independent of the moves of the other agents, 
and therefore of rij. This means that variance is low, and there is a crisp "signal-to- noise" guiding 
i's updates. In this situation the agent can achieve a preset accuracy level in its updating with a 
minimal total number of samples in the Monte Carlo block. 

Plug the general form for a difference utility into the formula for f x f (n») to see that (due to 
cancellation with the fj[rii) term) its n 9 -averaged value is independent of D l . Accordingly bias must 
equal for difference utilities. (In fact, difference utilities are the only utility that is guaranteed to 
have zero bias for all q-i ) So our expected error reduces to the sum over all Xi of the variance for 
each Xi. 



For each one of those variances again use Eq. 33 with G replaced by g % throughout to expand 



fxf (n-i). Since the qi(xi) terms in that expansion are all the same constant independent of Uj, they 
don't contribute to the variance. Accordingly we have 



Var(jjf(nf)) = Var(^(nf) - ^ 



Var 



1 



E ^K))- (39) 



Since nf is fixed and we are doing IID sampling, the two expressions inside the variance function 
are statistically independent. In addition, the variance of a difference of independent variables is 
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the sum of the variances. Accordingly, the sum over all Xi of our variances is (cf. Eq. (36)) 



£Var(£f(nf)) = £Var 

= E 
= E 
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(40) 



where the third equation follows from the second by using the trivial identity Y^ a X^&^ a ^ ? (^) = 

Ea ^(a) Efe^a 1 for an y function F. 

Since for each such x% we are doing L-^-fold IID sampling of an associated fixed distribution, 
the variance for each separate x% is of the form 



dy P(y) 
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for a fixed distribution P(y\,y2, . . ■ , y L;c ) = IIj=i P(Vj)- We can again use the decomposition of a 
variance of a sum into a sum of variances to evaluate this. With the distribution implicit, define 
the single sample variance for the value of any function H(x), for move Xj, as 



This gives 



Collecting terms, we get 



Var(# (x,)) 4 E([i?] 2 |xi) - [E{H\ Xl )} 2 . 
Var(4(nf)) = Var^^)) / 2^ 

|^| _ i Var(^(xi)) 
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(41) 
(42) 

(43) 



Now Var(vl(r)) = (1/2) £4 ta P(ii)P(i2)L4(ii) - A(t 2 )} 2 for any random variable r with dis- 
tribution P(t). Use this to rewrite the sum in Eq. ( |43| ) as 

1 1 E W E ^(xLj^Cx'^b^x^x'.J-^^^x^)] 2 



21*,; 



Bring the sum over Xj inside the other integrals, expand c/ 1 , and drop the overall multiplicative 
constant to get 

Var(i$(n?)) oc E MO^W 



x' ,x". 
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For each x.'_ i and x" 4 , our choice of D l minimizes the sum so long as the difference in the curly 
brackets obeys 



This can be assured by picking 



E 



L, 



(44) 



for all x_j. 

Note that AU minimizes variance of gradient descent updating regardless of the form of q. 
Indeed, being independent of q_ t , it minimizes our original q_ i integral in Eq. (35), regardless of 
the prior P{q_ i ). For the same reason it is optimal if the integral is replaced by a worst-case bound 
over q_.. 
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